Replace or Retrieve Keywords In Documents at Scale

نویسنده

  • Vikash Singh
چکیده

In this paper we introduce, the ​FlashText​,​1 algorithm for replacing keywords or finding keywords in a given text. FlashText can search or replace keywords in one pass over a document. The time complexity of this algorithm is not dependent on the number of terms being searched or replaced. For a document of size ​N ​(characters) and a dictionary of ​M keywords, the time complexity will be ​O(N)​. This algorithm is much faster than Regex (see Figure 1 & 2), because regex time complexity is ​O(M *N)​. It is also different from Aho Corasick Algorithm,​3 as it doesn’t match substrings. FlashText is designed to only match complete words (words with boundary characters,​2 on both sides). For an input dictionary of ​{Apple}​, this algorithm won’t match it to ‘​I like Pineapple​’. This algorithm is also designed to go for the longest match first. For an input dictionary ​{Machine, Learning, Machine learning} on a string ‘​I like Machine learning​’,​ ​it​ ​will​ ​only​ ​consider​ ​the​ ​longest​ ​match,​ ​which​ ​is​ ​​Machine​ ​Learning​. We have made python implementation of this algorithm available as open-source on GitHub,​1 released under the​ ​permissive​ ​MIT​ ​License. Subjects​ ​​Data​ ​Structures​ ​and​ ​Algorithms​ ​(cs.DS) Keywords​​ ​​Information​ ​retrieval,​ ​Keyword​ ​Search,​ ​Regex,​ ​Keyword​ ​Replace,​ ​FlashText

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Search Result Clustering Method at NTCIR-5 Web Query Expansion Subtask

We use a retrieval system with search result clustering to tackle the NTCIR-5 WEB Query Term Expansion Subtask. The system clusters the search results in such a way as to make it easier for the user to select relevant documents as feedback documents. In addition, we select phrase words or named entities(NE) as query-expansion keywords from the feedback documents because these words tend to repr...

متن کامل

Semantic Retrieval System Based on Ontology

The recall factor is low when keywords are used to retrieve information, and many related documents are omitted. Semantic annotation is used to comment documents to improve the recall factor. While extremely large instances querying requirements may crash ABox reasoner. In this research, a method is proposed to improve the efficiency of semantic retrieving via combining ABox reasoning and datab...

متن کامل

Semantic Based Information Extraction from Web

Extraction of information from web is a challenging task. The information stored in a web may be structured or unstructured information. The structured information provides enhanced knowledge which helps to retrieve relevant documents. It helps the user to understand particular domain. This paper explores the importance of information extraction using semantics. It enables the users to discover...

متن کامل

خطا های شایع در کلید واژه های انگلیسی مقالات حوزه آموزش علوم پزشکی

Background and purpose: Author-assigned keywords at the end of the abstracts in scientific articles are the words most relevant to the content of the article. They are the main sources for indexing and storing the articles in databases, and help to retrieve related articles. Therefore, any mistake or ambiguity in keywords lead to disruption of both data storage and retrieval processes. This stu...

متن کامل

A system for retrieving broadcast news speech documents using voice input keywords and similarity between words

This paper describes a robust speech documents retrieval system that uses voice input keywords. To solve the inevitable problems which arise when the input to the system is speech, i.e. misrecognition, a novel method was developed, where, before the retrieval processing, unproductive keyword candidates are discarded by a grouping processing using the similarity between words and the recognition...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1711.00046  شماره 

صفحات  -

تاریخ انتشار 2017